{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 10.3 Apply：General split-apply-combine（应用：通用的分割-应用-合并）\n",
    "\n",
    ">general-purpose: 可以理解为通用，泛用。\n",
    "\n",
    ">例子：在计算机软件中，通用编程语言(General-purpose programming language )指被设计为各种应用领域服务的编程语言。通常通用编程语言不含有为特定应用领域设计的结构。\n",
    "\n",
    ">相对而言，特定域编程语言就是为某一个特定的领域或应用软件设计的编程语言。比如说，LaTeX就是专门为排版文献而设计的语言。\n",
    "\n",
    "最通用的GroupBy(分组)方法是apply，这也是本节的主题。如下图所示，apply会把对象分为多个部分，然后将函数应用到每一个部分上，然后把所有的部分都合并起来：\n",
    "\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/81f9f.png)\n",
    "\n",
    "返回之前提到的tipping数据集，假设我们想要根据不同组（group），选择前5个tip_pct值最大的。首先，写一个函数，函数的功能为在特定的列，选出有最大值的行:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tips = pd.read_csv('../examples/tips.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Add tip percentage of total bill\n",
    "tips['tip_pct'] = tips['tip'] / tips['total_bill']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>smoker</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "      <th>tip_pct</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>16.99</td>\n",
       "      <td>1.01</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.059447</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10.34</td>\n",
       "      <td>1.66</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>3</td>\n",
       "      <td>0.160542</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>21.01</td>\n",
       "      <td>3.50</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>3</td>\n",
       "      <td>0.166587</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23.68</td>\n",
       "      <td>3.31</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.139780</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>24.59</td>\n",
       "      <td>3.61</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "      <td>0.146808</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   total_bill   tip smoker  day    time  size   tip_pct\n",
       "0       16.99  1.01     No  Sun  Dinner     2  0.059447\n",
       "1       10.34  1.66     No  Sun  Dinner     3  0.160542\n",
       "2       21.01  3.50     No  Sun  Dinner     3  0.166587\n",
       "3       23.68  3.31     No  Sun  Dinner     2  0.139780\n",
       "4       24.59  3.61     No  Sun  Dinner     4  0.146808"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tips.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def top(df, n=5, column='tip_pct'):\n",
    "    return df.sort_values(by=column)[-n:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>smoker</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "      <th>tip_pct</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>109</th>\n",
       "      <td>14.31</td>\n",
       "      <td>4.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.279525</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>183</th>\n",
       "      <td>23.17</td>\n",
       "      <td>6.50</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "      <td>0.280535</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>232</th>\n",
       "      <td>11.61</td>\n",
       "      <td>3.39</td>\n",
       "      <td>No</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.291990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>67</th>\n",
       "      <td>3.07</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>1</td>\n",
       "      <td>0.325733</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>178</th>\n",
       "      <td>9.60</td>\n",
       "      <td>4.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.416667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>172</th>\n",
       "      <td>7.25</td>\n",
       "      <td>5.15</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.710345</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     total_bill   tip smoker  day    time  size   tip_pct\n",
       "109       14.31  4.00    Yes  Sat  Dinner     2  0.279525\n",
       "183       23.17  6.50    Yes  Sun  Dinner     4  0.280535\n",
       "232       11.61  3.39     No  Sat  Dinner     2  0.291990\n",
       "67         3.07  1.00    Yes  Sat  Dinner     1  0.325733\n",
       "178        9.60  4.00    Yes  Sun  Dinner     2  0.416667\n",
       "172        7.25  5.15    Yes  Sun  Dinner     2  0.710345"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top(tips, n=6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，如果我们按smoker分组，然后用apply来使用这个函数，我们能得到下面的结果："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>smoker</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "      <th>tip_pct</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>smoker</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">No</th>\n",
       "      <th>88</th>\n",
       "      <td>24.71</td>\n",
       "      <td>5.85</td>\n",
       "      <td>No</td>\n",
       "      <td>Thur</td>\n",
       "      <td>Lunch</td>\n",
       "      <td>2</td>\n",
       "      <td>0.236746</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>185</th>\n",
       "      <td>20.69</td>\n",
       "      <td>5.00</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>5</td>\n",
       "      <td>0.241663</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51</th>\n",
       "      <td>10.29</td>\n",
       "      <td>2.60</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.252672</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149</th>\n",
       "      <td>7.51</td>\n",
       "      <td>2.00</td>\n",
       "      <td>No</td>\n",
       "      <td>Thur</td>\n",
       "      <td>Lunch</td>\n",
       "      <td>2</td>\n",
       "      <td>0.266312</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>232</th>\n",
       "      <td>11.61</td>\n",
       "      <td>3.39</td>\n",
       "      <td>No</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.291990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">Yes</th>\n",
       "      <th>109</th>\n",
       "      <td>14.31</td>\n",
       "      <td>4.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.279525</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>183</th>\n",
       "      <td>23.17</td>\n",
       "      <td>6.50</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "      <td>0.280535</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>67</th>\n",
       "      <td>3.07</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>1</td>\n",
       "      <td>0.325733</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>178</th>\n",
       "      <td>9.60</td>\n",
       "      <td>4.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.416667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>172</th>\n",
       "      <td>7.25</td>\n",
       "      <td>5.15</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.710345</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            total_bill   tip smoker   day    time  size   tip_pct\n",
       "smoker                                                           \n",
       "No     88        24.71  5.85     No  Thur   Lunch     2  0.236746\n",
       "       185       20.69  5.00     No   Sun  Dinner     5  0.241663\n",
       "       51        10.29  2.60     No   Sun  Dinner     2  0.252672\n",
       "       149        7.51  2.00     No  Thur   Lunch     2  0.266312\n",
       "       232       11.61  3.39     No   Sat  Dinner     2  0.291990\n",
       "Yes    109       14.31  4.00    Yes   Sat  Dinner     2  0.279525\n",
       "       183       23.17  6.50    Yes   Sun  Dinner     4  0.280535\n",
       "       67         3.07  1.00    Yes   Sat  Dinner     1  0.325733\n",
       "       178        9.60  4.00    Yes   Sun  Dinner     2  0.416667\n",
       "       172        7.25  5.15    Yes   Sun  Dinner     2  0.710345"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tips.groupby('smoker').apply(top)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们来解释下上面这一行代码发生了什么。这里的top函数，在每一个DataFrame中的行组（row group）都被调用了一次，然后各自的结果通过pandas.concat合并了，最后用组名（group names）来标记每一部分。（译者：可以理解为，我们先按smoker这一列对整个DataFrame进行了分组，一共有No和Yes两组，然后对每一组上调用了top函数，所以每一组会返还5行作为结果，最后把两组的结果整合起来，一共是10行）。\n",
    "\n",
    "最后的结果是有多层级索引（hierarchical index）的，而且这个多层级索引的内部层级（inner level）含有来自于原来DataFrame中的索引值（index values）（译者：即在smoker为No的这一组，No本身是一个索引，它的内层索引是88, 185, 51, 149, 232这五个行索引，这五个内部层级是来自于原始DataFrame的）。\n",
    "\n",
    "如果传递一个函数给apply，可以在函数之后，设定其他一些参数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>smoker</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "      <th>tip_pct</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>smoker</th>\n",
       "      <th>day</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">No</th>\n",
       "      <th>Fri</th>\n",
       "      <th>94</th>\n",
       "      <td>22.75</td>\n",
       "      <td>3.25</td>\n",
       "      <td>No</td>\n",
       "      <td>Fri</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.142857</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sat</th>\n",
       "      <th>212</th>\n",
       "      <td>48.33</td>\n",
       "      <td>9.00</td>\n",
       "      <td>No</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "      <td>0.186220</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sun</th>\n",
       "      <th>156</th>\n",
       "      <td>48.17</td>\n",
       "      <td>5.00</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>6</td>\n",
       "      <td>0.103799</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Thur</th>\n",
       "      <th>142</th>\n",
       "      <td>41.19</td>\n",
       "      <td>5.00</td>\n",
       "      <td>No</td>\n",
       "      <td>Thur</td>\n",
       "      <td>Lunch</td>\n",
       "      <td>5</td>\n",
       "      <td>0.121389</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">Yes</th>\n",
       "      <th>Fri</th>\n",
       "      <th>95</th>\n",
       "      <td>40.17</td>\n",
       "      <td>4.73</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Fri</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "      <td>0.117750</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sat</th>\n",
       "      <th>170</th>\n",
       "      <td>50.81</td>\n",
       "      <td>10.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>3</td>\n",
       "      <td>0.196812</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sun</th>\n",
       "      <th>182</th>\n",
       "      <td>45.35</td>\n",
       "      <td>3.50</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>3</td>\n",
       "      <td>0.077178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Thur</th>\n",
       "      <th>197</th>\n",
       "      <td>43.11</td>\n",
       "      <td>5.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Thur</td>\n",
       "      <td>Lunch</td>\n",
       "      <td>4</td>\n",
       "      <td>0.115982</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 total_bill    tip smoker   day    time  size   tip_pct\n",
       "smoker day                                                             \n",
       "No     Fri  94        22.75   3.25     No   Fri  Dinner     2  0.142857\n",
       "       Sat  212       48.33   9.00     No   Sat  Dinner     4  0.186220\n",
       "       Sun  156       48.17   5.00     No   Sun  Dinner     6  0.103799\n",
       "       Thur 142       41.19   5.00     No  Thur   Lunch     5  0.121389\n",
       "Yes    Fri  95        40.17   4.73    Yes   Fri  Dinner     4  0.117750\n",
       "       Sat  170       50.81  10.00    Yes   Sat  Dinner     3  0.196812\n",
       "       Sun  182       45.35   3.50    Yes   Sun  Dinner     3  0.077178\n",
       "       Thur 197       43.11   5.00    Yes  Thur   Lunch     4  0.115982"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "除了上面这些基本用法，要想用好apply可能需要一点创新能力。毕竟传给这个函数的内容取决于我们自己，而最终的结果只需要返回一个pandas对象或一个标量。这一章的剩余部分主要介绍如何解决在使用groupby时遇到的一些问题。\n",
    "\n",
    "可以试一试在GroupBy对象上调用describe："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = tips.groupby('smoker')['tip_pct'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "smoker       \n",
       "No      count    151.000000\n",
       "        mean       0.159328\n",
       "        std        0.039910\n",
       "        min        0.056797\n",
       "        25%        0.136906\n",
       "        50%        0.155625\n",
       "        75%        0.185014\n",
       "        max        0.291990\n",
       "Yes     count     93.000000\n",
       "        mean       0.163196\n",
       "        std        0.085119\n",
       "        min        0.035638\n",
       "        25%        0.106771\n",
       "        50%        0.153846\n",
       "        75%        0.195059\n",
       "        max        0.710345\n",
       "Name: tip_pct, dtype: float64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>smoker</th>\n",
       "      <th>No</th>\n",
       "      <th>Yes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>151.000000</td>\n",
       "      <td>93.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.159328</td>\n",
       "      <td>0.163196</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.039910</td>\n",
       "      <td>0.085119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.056797</td>\n",
       "      <td>0.035638</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.136906</td>\n",
       "      <td>0.106771</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.155625</td>\n",
       "      <td>0.153846</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.185014</td>\n",
       "      <td>0.195059</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>0.291990</td>\n",
       "      <td>0.710345</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "smoker          No        Yes\n",
       "count   151.000000  93.000000\n",
       "mean      0.159328   0.163196\n",
       "std       0.039910   0.085119\n",
       "min       0.056797   0.035638\n",
       "25%       0.136906   0.106771\n",
       "50%       0.155625   0.153846\n",
       "75%       0.185014   0.195059\n",
       "max       0.291990   0.710345"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result.unstack('smoker')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在GroupBy内部，当我们想要调用一个像describe这样的函数的时候，其实相当于下面的写法：\n",
    "\n",
    "    f = lambda x: x.describe()\n",
    "    grouped.apply(f)\n",
    "    \n",
    "# 1 Suppressing the Group Keys（抑制组键）\n",
    "\n",
    "在接下来的例子，我们会看到作为结果的对象有一个多层级索引（hierarchical index），这个多层级索引是由原来的对象中，组键（group key）在每一部分的索引上得到的。我们可以在groupby函数中设置group_keys=False来关闭这个功能："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>smoker</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "      <th>tip_pct</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>88</th>\n",
       "      <td>24.71</td>\n",
       "      <td>5.85</td>\n",
       "      <td>No</td>\n",
       "      <td>Thur</td>\n",
       "      <td>Lunch</td>\n",
       "      <td>2</td>\n",
       "      <td>0.236746</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>185</th>\n",
       "      <td>20.69</td>\n",
       "      <td>5.00</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>5</td>\n",
       "      <td>0.241663</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51</th>\n",
       "      <td>10.29</td>\n",
       "      <td>2.60</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.252672</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149</th>\n",
       "      <td>7.51</td>\n",
       "      <td>2.00</td>\n",
       "      <td>No</td>\n",
       "      <td>Thur</td>\n",
       "      <td>Lunch</td>\n",
       "      <td>2</td>\n",
       "      <td>0.266312</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>232</th>\n",
       "      <td>11.61</td>\n",
       "      <td>3.39</td>\n",
       "      <td>No</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.291990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>109</th>\n",
       "      <td>14.31</td>\n",
       "      <td>4.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.279525</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>183</th>\n",
       "      <td>23.17</td>\n",
       "      <td>6.50</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "      <td>0.280535</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>67</th>\n",
       "      <td>3.07</td>\n",
       "      <td>1.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>1</td>\n",
       "      <td>0.325733</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>178</th>\n",
       "      <td>9.60</td>\n",
       "      <td>4.00</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.416667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>172</th>\n",
       "      <td>7.25</td>\n",
       "      <td>5.15</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "      <td>0.710345</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     total_bill   tip smoker   day    time  size   tip_pct\n",
       "88        24.71  5.85     No  Thur   Lunch     2  0.236746\n",
       "185       20.69  5.00     No   Sun  Dinner     5  0.241663\n",
       "51        10.29  2.60     No   Sun  Dinner     2  0.252672\n",
       "149        7.51  2.00     No  Thur   Lunch     2  0.266312\n",
       "232       11.61  3.39     No   Sat  Dinner     2  0.291990\n",
       "109       14.31  4.00    Yes   Sat  Dinner     2  0.279525\n",
       "183       23.17  6.50    Yes   Sun  Dinner     4  0.280535\n",
       "67         3.07  1.00    Yes   Sat  Dinner     1  0.325733\n",
       "178        9.60  4.00    Yes   Sun  Dinner     2  0.416667\n",
       "172        7.25  5.15    Yes   Sun  Dinner     2  0.710345"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tips.groupby('smoker', group_keys=False).apply(top)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2 Quantile and Bucket Analysis（分位数与桶分析）\n",
    "\n",
    "在第八章中，我们介绍了pandas的一些工具，比如cut和qcut，通过设置中位数，切割数据为buckets with bins(有很多箱子的桶)。\n",
    "\n",
    "> 这里bucket我翻译为桶，可以理解为像group一样的概念，一个组内有不同的bins。而关于bins（箱）的部分，可以回顾看一下7.2\n",
    "\n",
    "把函数通过groupby整合起来，可以在做桶分析或分位数分析的时候更方便。假设一个简单的随机数据集和一个等长的桶类型（bucket categorization），使用cut："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>data1</th>\n",
       "      <th>data2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.723973</td>\n",
       "      <td>0.120216</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.053617</td>\n",
       "      <td>0.468000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.543073</td>\n",
       "      <td>-1.874073</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.915136</td>\n",
       "      <td>0.159179</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.775965</td>\n",
       "      <td>0.105447</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      data1     data2\n",
       "0  0.723973  0.120216\n",
       "1  2.053617  0.468000\n",
       "2 -0.543073 -1.874073\n",
       "3 -0.915136  0.159179\n",
       "4  0.775965  0.105447"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame = pd.DataFrame({'data1': np.random.randn(1000),\n",
    "                       'data2': np.random.randn(1000)})\n",
    "frame.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     (0.194, 1.795]\n",
       "1     (1.795, 3.395]\n",
       "2    (-1.407, 0.194]\n",
       "3    (-1.407, 0.194]\n",
       "4     (0.194, 1.795]\n",
       "5     (0.194, 1.795]\n",
       "6     (0.194, 1.795]\n",
       "7    (-1.407, 0.194]\n",
       "8    (-1.407, 0.194]\n",
       "9    (-1.407, 0.194]\n",
       "Name: data1, dtype: category\n",
       "Categories (4, object): [(-3.0139, -1.407] < (-1.407, 0.194] < (0.194, 1.795] < (1.795, 3.395]]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quartiles = pd.cut(frame.data1, 4)\n",
    "quartiles[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "cut返回的Categorical object（类别对象）能直接传入groupby。所以我们可以在data2列上计算很多统计值："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_stats(group):\n",
    "    return {'min': group.min(), 'max': group.max(),\n",
    "            'count': group.count(), 'mean': group.mean()}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "grouped = frame.data2.groupby(quartiles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>max</th>\n",
       "      <th>mean</th>\n",
       "      <th>min</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>data1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>(-3.0139, -1.407]</th>\n",
       "      <td>70.0</td>\n",
       "      <td>2.035166</td>\n",
       "      <td>0.113238</td>\n",
       "      <td>-2.363707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>(-1.407, 0.194]</th>\n",
       "      <td>481.0</td>\n",
       "      <td>3.284688</td>\n",
       "      <td>-0.044535</td>\n",
       "      <td>-2.647341</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>(0.194, 1.795]</th>\n",
       "      <td>407.0</td>\n",
       "      <td>2.402272</td>\n",
       "      <td>-0.043887</td>\n",
       "      <td>-2.898145</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>(1.795, 3.395]</th>\n",
       "      <td>42.0</td>\n",
       "      <td>2.051843</td>\n",
       "      <td>0.095178</td>\n",
       "      <td>-2.234979</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   count       max      mean       min\n",
       "data1                                                 \n",
       "(-3.0139, -1.407]   70.0  2.035166  0.113238 -2.363707\n",
       "(-1.407, 0.194]    481.0  3.284688 -0.044535 -2.647341\n",
       "(0.194, 1.795]     407.0  2.402272 -0.043887 -2.898145\n",
       "(1.795, 3.395]      42.0  2.051843  0.095178 -2.234979"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped.apply(get_stats).unstack()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "也有相同长度的桶（equal-length buckets）；想要按照样本的分位数得到相同长度的桶，用qcut。这里设定labels=False来得到分位数的数量："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Return quantile numbers\n",
    "grouping = pd.qcut(frame.data1, 10, labels=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "译者：上面的代码是把frame的data1列分为10个bin，每个bin都有相同的数量。因为一共有1000个样本，所以每个bin里有100个样本。grouping保存的是每个样本的index以及其对应的bin的编号。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "grouped = frame.data2.groupby(grouping)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>max</th>\n",
       "      <th>mean</th>\n",
       "      <th>min</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>data1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100.0</td>\n",
       "      <td>2.178653</td>\n",
       "      <td>0.078390</td>\n",
       "      <td>-2.363707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>100.0</td>\n",
       "      <td>3.284688</td>\n",
       "      <td>-0.018699</td>\n",
       "      <td>-2.647341</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>100.0</td>\n",
       "      <td>2.214011</td>\n",
       "      <td>-0.066341</td>\n",
       "      <td>-2.262063</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>100.0</td>\n",
       "      <td>2.880188</td>\n",
       "      <td>-0.014041</td>\n",
       "      <td>-2.475753</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>100.0</td>\n",
       "      <td>2.741344</td>\n",
       "      <td>-0.007952</td>\n",
       "      <td>-2.576095</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>100.0</td>\n",
       "      <td>2.346857</td>\n",
       "      <td>-0.109602</td>\n",
       "      <td>-2.898145</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>100.0</td>\n",
       "      <td>2.402272</td>\n",
       "      <td>0.004522</td>\n",
       "      <td>-1.911955</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>100.0</td>\n",
       "      <td>2.351513</td>\n",
       "      <td>-0.161472</td>\n",
       "      <td>-2.640625</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>100.0</td>\n",
       "      <td>2.135995</td>\n",
       "      <td>-0.016079</td>\n",
       "      <td>-1.986676</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>100.0</td>\n",
       "      <td>2.051843</td>\n",
       "      <td>0.037685</td>\n",
       "      <td>-2.513164</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       count       max      mean       min\n",
       "data1                                     \n",
       "0      100.0  2.178653  0.078390 -2.363707\n",
       "1      100.0  3.284688 -0.018699 -2.647341\n",
       "2      100.0  2.214011 -0.066341 -2.262063\n",
       "3      100.0  2.880188 -0.014041 -2.475753\n",
       "4      100.0  2.741344 -0.007952 -2.576095\n",
       "5      100.0  2.346857 -0.109602 -2.898145\n",
       "6      100.0  2.402272  0.004522 -1.911955\n",
       "7      100.0  2.351513 -0.161472 -2.640625\n",
       "8      100.0  2.135995 -0.016079 -1.986676\n",
       "9      100.0  2.051843  0.037685 -2.513164"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped.apply(get_stats).unstack()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于pandas的Categorical类型，会在第十二章做详细介绍。\n",
    "\n",
    "# 3 Example: Filling Missing Values with Group-Specific Values（例子：用组特异性值来填充缺失值）\n",
    "\n",
    "在处理缺失值的时候，一些情况下我们会直接用dropna来把缺失值删除，但另一些情况下，我们希望用一些固定的值来代替缺失值，而fillna就是用来做这个的，例如，这里我们用平均值mean来代替缺失值NA："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "s = pd.Series(np.random.randn(6))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "s[::2] = np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         NaN\n",
       "1    0.878562\n",
       "2         NaN\n",
       "3   -0.264051\n",
       "4         NaN\n",
       "5    0.760488\n",
       "dtype: float64"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.458333\n",
       "1    0.878562\n",
       "2    0.458333\n",
       "3   -0.264051\n",
       "4    0.458333\n",
       "5    0.760488\n",
       "dtype: float64"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.fillna(s.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "假设我们想要给每一组填充不同的值。一个方法就是对数据分组后，用apply来调用fillna，在每一个组上执行一次。这里有一些样本是把美国各州分为西部和东部："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "states = ['Ohio', 'New York', 'Vermont', 'Florida',\n",
    "          'Oregon', 'Nevada', 'California', 'Idaho']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['East', 'East', 'East', 'East', 'West', 'West', 'West', 'West']"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "group_key = ['East'] * 4 + ['West'] * 4\n",
    "group_key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ohio          0.683283\n",
       "New York     -1.059896\n",
       "Vermont       0.105837\n",
       "Florida      -0.328586\n",
       "Oregon        1.973413\n",
       "Nevada        0.656673\n",
       "California    0.001700\n",
       "Idaho        -0.713295\n",
       "dtype: float64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.Series(np.random.randn(8), index=states)\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们令data中某些值为缺失值："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ohio          0.683283\n",
       "New York     -1.059896\n",
       "Vermont            NaN\n",
       "Florida      -0.328586\n",
       "Oregon        1.973413\n",
       "Nevada             NaN\n",
       "California    0.001700\n",
       "Idaho              NaN\n",
       "dtype: float64"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[['Vermont', 'Nevada', 'Idaho']] = np.nan\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "East   -0.235066\n",
       "West    0.987556\n",
       "dtype: float64"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.groupby(group_key).mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后我们可以用每个组的平均值来填充NA："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "fill_mean = lambda g: g.fillna(g.mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ohio          0.683283\n",
       "New York     -1.059896\n",
       "Vermont      -0.235066\n",
       "Florida      -0.328586\n",
       "Oregon        1.973413\n",
       "Nevada        0.987556\n",
       "California    0.001700\n",
       "Idaho         0.987556\n",
       "dtype: float64"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.groupby(group_key).apply(fill_mean)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在另外一些情况下，我们可能希望提前设定好用于不同组的填充值。因为group有一个name属性，我们可以利用这个："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "fill_values = {'East': 0.5, 'West': -1}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "fill_func = lambda g: g.fillna(fill_values[g.name])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ohio          0.683283\n",
       "New York     -1.059896\n",
       "Vermont       0.500000\n",
       "Florida      -0.328586\n",
       "Oregon        1.973413\n",
       "Nevada       -1.000000\n",
       "California    0.001700\n",
       "Idaho        -1.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.groupby(group_key).apply(fill_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4 Example: Random Sampling and Permutation（例子：随机抽样和排列）\n",
    "\n",
    "假设我们想要从一个很大的数据集里随机抽出一些样本，这里我们可以在Series上用sample方法。为了演示，这里县创建一副模拟的扑克牌："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Hearts红桃，Spades黑桃，Clubs梅花，Diamonds方片\n",
    "suits = ['H', 'S', 'C', 'D']\n",
    "card_val = (list(range(1, 11)) + [10] * 3) * 4\n",
    "base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']\n",
    "cards = []\n",
    "for suit in ['H', 'S', 'C', 'D']:\n",
    "    cards.extend(str(num) + suit for num in base_names)\n",
    "\n",
    "deck = pd.Series(card_val, index=cards)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这样我们就得到了一个长度为52的Series，索引（index）部分是牌的名字，对应的值为牌的点数，这里的点数是按Blackjack（二十一点）的游戏规则来设定的。\n",
    "\n",
    "> Blackjack（二十一点）: 2点至10点的牌以牌面的点数计算，J、Q、K 每张为10点，A可记为1点或为11点。这里为了方便，我们只把A记为1点。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AH      1\n",
       "2H      2\n",
       "3H      3\n",
       "4H      4\n",
       "5H      5\n",
       "6H      6\n",
       "7H      7\n",
       "8H      8\n",
       "9H      9\n",
       "10H    10\n",
       "JH     10\n",
       "KH     10\n",
       "QH     10\n",
       "dtype: int64"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "deck[:13]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，就像我们上面说的，随机从牌组中抽出5张牌："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def draw(deck, n=5):\n",
    "    return deck.sample(n)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7H     7\n",
       "6D     6\n",
       "AC     1\n",
       "JH    10\n",
       "JS    10\n",
       "dtype: int64"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "draw(deck)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "假设我们想要从每副花色中随机抽取两张，花色是每张牌名字的最后一个字符（即H, S, C, D），我们可以根据花色分组，然后使用apply："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "get_suit = lambda card: card[-1] # last letter is suit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "C  QC    10\n",
       "   9C     9\n",
       "D  3D     3\n",
       "   JD    10\n",
       "H  KH    10\n",
       "   6H     6\n",
       "S  3S     3\n",
       "   7S     7\n",
       "dtype: int64"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "deck.groupby(get_suit).apply(draw, n=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "另外一种写法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7C     7\n",
       "KC    10\n",
       "AD     1\n",
       "4D     4\n",
       "AH     1\n",
       "8H     8\n",
       "7S     7\n",
       "9S     9\n",
       "dtype: int64"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "deck.groupby(get_suit, group_keys=False).apply(draw, n=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5 Example: Group Weighted Average and Correlation（例子：组加权平均和相关性）\n",
    "\n",
    "在groupby的split-apply-combine机制下，DataFrame的两列或两个Series，计算组加权平均（Group Weighted Average）是可能的。这里举个例子，下面的数据集包含组键，值，以及权重："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>data</th>\n",
       "      <th>weights</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>0.098020</td>\n",
       "      <td>0.008455</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a</td>\n",
       "      <td>1.389496</td>\n",
       "      <td>0.826219</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a</td>\n",
       "      <td>0.202869</td>\n",
       "      <td>0.258955</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>a</td>\n",
       "      <td>-0.242403</td>\n",
       "      <td>0.470473</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>b</td>\n",
       "      <td>-0.820507</td>\n",
       "      <td>0.628758</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>b</td>\n",
       "      <td>0.866326</td>\n",
       "      <td>0.653632</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>b</td>\n",
       "      <td>-1.297375</td>\n",
       "      <td>0.639703</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>b</td>\n",
       "      <td>0.525019</td>\n",
       "      <td>0.012664</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  category      data   weights\n",
       "0        a  0.098020  0.008455\n",
       "1        a  1.389496  0.826219\n",
       "2        a  0.202869  0.258955\n",
       "3        a -0.242403  0.470473\n",
       "4        b -0.820507  0.628758\n",
       "5        b  0.866326  0.653632\n",
       "6        b -1.297375  0.639703\n",
       "7        b  0.525019  0.012664"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'category': ['a', 'a', 'a', 'a',\n",
    "                                'b', 'b', 'b', 'b'],\n",
    "                   'data': np.random.randn(8),\n",
    "                   'weights': np.random.rand(8)})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "按category分组来计算组加权平均："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "grouped = df.groupby('category')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "get_wavg = lambda g: np.average(g['data'], weights=g['weights'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "category\n",
       "a    0.695189\n",
       "b   -0.399497\n",
       "dtype: float64"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped.apply(get_wavg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "另一个例子，考虑一个从Yahoo！财经上得到的经济数据集，包含一些股票交易日结束时的股价，以及S&P 500指数(即SPX符号)：\n",
    "\n",
    ">标准普尔500指数英文简写为S&P 500 Index，是记录美国500家上市公司的一个股票指数。这个股票指数由标准普尔公司创建并维护。\n",
    "\n",
    ">标准普尔500指数覆盖的所有公司，都是在美国主要交易所，如纽约证券交易所、Nasdaq交易的上市公司。与道琼斯指数相比，标准普尔500指数包含的公司更多，因此风险更为分散，能够反映更广泛的市场变化。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "close_px = pd.read_csv('../examples/stock_px_2.csv', parse_dates=True,\n",
    "                       index_col=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14\n",
      "Data columns (total 4 columns):\n",
      "AAPL    2214 non-null float64\n",
      "MSFT    2214 non-null float64\n",
      "XOM     2214 non-null float64\n",
      "SPX     2214 non-null float64\n",
      "dtypes: float64(4)\n",
      "memory usage: 86.5 KB\n"
     ]
    }
   ],
   "source": [
    "close_px.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AAPL</th>\n",
       "      <th>MSFT</th>\n",
       "      <th>XOM</th>\n",
       "      <th>SPX</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2011-10-11</th>\n",
       "      <td>400.29</td>\n",
       "      <td>27.00</td>\n",
       "      <td>76.27</td>\n",
       "      <td>1195.54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-10-12</th>\n",
       "      <td>402.19</td>\n",
       "      <td>26.96</td>\n",
       "      <td>77.16</td>\n",
       "      <td>1207.25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-10-13</th>\n",
       "      <td>408.43</td>\n",
       "      <td>27.18</td>\n",
       "      <td>76.37</td>\n",
       "      <td>1203.66</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011-10-14</th>\n",
       "      <td>422.00</td>\n",
       "      <td>27.27</td>\n",
       "      <td>78.11</td>\n",
       "      <td>1224.58</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              AAPL   MSFT    XOM      SPX\n",
       "2011-10-11  400.29  27.00  76.27  1195.54\n",
       "2011-10-12  402.19  26.96  77.16  1207.25\n",
       "2011-10-13  408.43  27.18  76.37  1203.66\n",
       "2011-10-14  422.00  27.27  78.11  1224.58"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "close_px[-4:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一个比较有意思的尝试是计算一个DataFrame，包括与SPX这一列逐年日收益的相关性（计算百分比变化）。一个可能的方法是，我们先创建一个能计算不同列相关性的函数，然后拿每一列与SPX这一列求相关性："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "spx_corr = lambda x: x.corrwith(x['SPX'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后我们通过pct_change在close_px上计算百分比的变化："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "rets = close_px.pct_change().dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后，我们按年来给这些百分比变化分组，年份可以从每行的标签中通过一个一行函数提取，然后返回的结果中，用datetime标签来表示年份："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "get_year = lambda x: x.year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "by_year = rets.groupby(get_year)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AAPL</th>\n",
       "      <th>MSFT</th>\n",
       "      <th>XOM</th>\n",
       "      <th>SPX</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2003</th>\n",
       "      <td>0.541124</td>\n",
       "      <td>0.745174</td>\n",
       "      <td>0.661265</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2004</th>\n",
       "      <td>0.374283</td>\n",
       "      <td>0.588531</td>\n",
       "      <td>0.557742</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2005</th>\n",
       "      <td>0.467540</td>\n",
       "      <td>0.562374</td>\n",
       "      <td>0.631010</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2006</th>\n",
       "      <td>0.428267</td>\n",
       "      <td>0.406126</td>\n",
       "      <td>0.518514</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007</th>\n",
       "      <td>0.508118</td>\n",
       "      <td>0.658770</td>\n",
       "      <td>0.786264</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2008</th>\n",
       "      <td>0.681434</td>\n",
       "      <td>0.804626</td>\n",
       "      <td>0.828303</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009</th>\n",
       "      <td>0.707103</td>\n",
       "      <td>0.654902</td>\n",
       "      <td>0.797921</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010</th>\n",
       "      <td>0.710105</td>\n",
       "      <td>0.730118</td>\n",
       "      <td>0.839057</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011</th>\n",
       "      <td>0.691931</td>\n",
       "      <td>0.800996</td>\n",
       "      <td>0.859975</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          AAPL      MSFT       XOM  SPX\n",
       "2003  0.541124  0.745174  0.661265  1.0\n",
       "2004  0.374283  0.588531  0.557742  1.0\n",
       "2005  0.467540  0.562374  0.631010  1.0\n",
       "2006  0.428267  0.406126  0.518514  1.0\n",
       "2007  0.508118  0.658770  0.786264  1.0\n",
       "2008  0.681434  0.804626  0.828303  1.0\n",
       "2009  0.707103  0.654902  0.797921  1.0\n",
       "2010  0.710105  0.730118  0.839057  1.0\n",
       "2011  0.691931  0.800996  0.859975  1.0"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "by_year.apply(spx_corr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们也可以计算列内的相关性。这里我们计算苹果和微软每年的相关性："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2003    0.480868\n",
       "2004    0.259024\n",
       "2005    0.300093\n",
       "2006    0.161735\n",
       "2007    0.417738\n",
       "2008    0.611901\n",
       "2009    0.432738\n",
       "2010    0.571946\n",
       "2011    0.581987\n",
       "dtype: float64"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6 Example: Group-Wise Linear Regression（例子：组对组的线性回归）\n",
    "\n",
    "就像上面介绍的例子，使用groupby可以用于更复杂的组对组统计分析，只要函数能返回一个pandas对象或标量。例如，我们可以定义regress函数（利用statsmodels库），在每一个数据块（each chunk of data）上进行普通最小平方回归（ordinary least squares (OLS) regression）计算："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def regress(data, yvar, xvars):\n",
    "    Y = data[yvar]\n",
    "    X = data[xvars]\n",
    "    X['intercept'] = 1\n",
    "    result = sm.OLS(Y, X).fit()\n",
    "    return result.params"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，按年用苹果AAPL在标普SPX上做线性回归："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SPX</th>\n",
       "      <th>intercept</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2003</th>\n",
       "      <td>1.195406</td>\n",
       "      <td>0.000710</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2004</th>\n",
       "      <td>1.363463</td>\n",
       "      <td>0.004201</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2005</th>\n",
       "      <td>1.766415</td>\n",
       "      <td>0.003246</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2006</th>\n",
       "      <td>1.645496</td>\n",
       "      <td>0.000080</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007</th>\n",
       "      <td>1.198761</td>\n",
       "      <td>0.003438</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2008</th>\n",
       "      <td>0.968016</td>\n",
       "      <td>-0.001110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009</th>\n",
       "      <td>0.879103</td>\n",
       "      <td>0.002954</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010</th>\n",
       "      <td>1.052608</td>\n",
       "      <td>0.001261</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2011</th>\n",
       "      <td>0.806605</td>\n",
       "      <td>0.001514</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           SPX  intercept\n",
       "2003  1.195406   0.000710\n",
       "2004  1.363463   0.004201\n",
       "2005  1.766415   0.003246\n",
       "2006  1.645496   0.000080\n",
       "2007  1.198761   0.003438\n",
       "2008  0.968016  -0.001110\n",
       "2009  0.879103   0.002954\n",
       "2010  1.052608   0.001261\n",
       "2011  0.806605   0.001514"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "by_year.apply(regress, 'AAPL', ['SPX'])"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [py35]",
   "language": "python",
   "name": "Python [py35]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
